import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Closed-form equation for minimizing the cost function for Linear Regression. Writing math in markdown
temp = $X^T$ * $X$
$\theta$ = $($temp$)^{-1}$ $X^T$ $y$
# Generating data for this equation
X = np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
Now let’s compute θ using the Normal Equation. We will use the inv() function from NumPy’s Linear Algebra module (np.linalg) to compute the inverse of a matrix, andthe dot() method for matrix multiplication:
X_b = np.c_[np.ones((100, 1)), X] # add x0=1 to each instance
theta_best = np.linalg.inv(
The actual function that we used to generate the data is y = 4 + 3x0 + Gaussian noise. Let’s see what the equation found:
Now you can make predictions using θ:
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]
y_predict =
Plotting the models predictions
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
# We can do the same using sklearn
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression(), y)
print(lin_reg.intercept_, lin_reg.coef_)
Another way to train Linear Regression Model - Gradient Descent
The general idea of Gradient Descent is totweak parameters iteratively in order to minimize a cost function.
Why Gradient Descent always works for Linear Regression?: Because of the cost function(MSE-mean squared error). MSE cost function for a Linear Regression model happens to be aconvex function, which means that if you pick any two points on the curve, the linesegment joining them never crosses the curve. This implies that there are no localminima, just one global minimum. It is also a continuous function with a slope thatnever changes abruptly. These two facts have a great consequence: Gradient Descentis guaranteed to approach arbitrarily close the global minimum (if you wait longenough and if the learning rate is not too high).
Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting ∇θMSE(θ) from θ. This is where the learning rate η comes into play: multiply the gradient vector by η to determine thesize of the downhill step.
θ = θ − η*(∇θMSE(θ))
eta = 0.1 # learning reate
n_iterations = 1000
m = 100 # number of training examples
theta = np.random.randn(2, 1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m* - y)
theta = theta - eta*gradients
This is exactly what the normal equation found!
# SGD using a simple learning schedule
n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameter
def learning_schedule(t):
return t0/(t+t1)
theta = np.random.randn(2, 1) # random initialization
for epoch in range(n_epochs):
for i in range(m):
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]
gradients = 2* # do not use 2/m as here only one training example
eta = learning_schedule(epoch*m + i)
theta = theta - eta*gradients
By convention we iterate by rounds of m iterations; each round is called an epoch. While the Batch Gradient Descent code iterated 1,000 times through the whole training set, this code goes through the training set only 50 times and reaches a fairly good solution
To perform Linear Regression using SGD with Scikit-Learn, you can use the SGDRegressor class, which defaults to optimizing the squared error cost function. The following code runs 50 epochs, starting with a learning rate of 0.1 (eta0=0.1), using the default learning schedule (different from the preceding one), and it does not use any regularization (penalty=None; more details on this shortly):
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=50, penalty=None, eta0=0.1), y.ravel())
print(sgd_reg.intercept_, sgd_reg.coef_)
Let’s generate some nonlinear data, based on a simple quadratic equation(with some noise)
m = 100
X = 6*np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
y = (0.5)x2 + x + c
# visualising this data
plt.plot(X, y, "b.")
plt.axis([-3, 3, 0, 10])
Clearly, a straight line will never fit this data properly. So let’s use Scikit-Learn’s PolynomialFeatures class to transform our training data, adding the square (2nd-degree polynomial) of each feature in the training set as new features (in this case there is just one feature):
PolynomialFeatures: Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
print('X[0]=', X[0])
print('X_poly[0]=', X_poly[0])
X_poly now contains the original feature of X plus the square of this feature. Now you can fit a LinearRegression model to this extended training data
lin_reg = LinearRegression(), y)
print(lin_reg.intercept_, lin_reg.coef_)
X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.axis([-3, 3, 0, 10])
Note that when there are multiple features, Polynomial Regression is capable of find‐ing relationships between features (which is something a plain Linear Regressionmodel cannot do). This is made possible by the fact that PolynomialFeatures alsoadds all combinations of features up to the given degree. For example, if there were two features a and b, PolynomialFeatures with degree=3 would not only add the features a2, a3, b2, and b3, but also the combinations ab, a2b, and ab2.
# The following code defines a function that plots the learningcurves of a model given some training data:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
def plot_learning_curve(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
train_errors, val_errors = [], []
for m in range(1, len(X_train)):[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train_predict, y_train[:m]))
val_errors.append(mean_squared_error(y_val_predict, y_val))
plt.xlabel('Training Set Size')
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train") #training error
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val") # validation error
# learning curve for linear regression
lin_reg = LinearRegression()
plot_learning_curve(lin_reg, X, y)
from sklearn.pipeline import Pipeline
X = np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
polynomial_regression = Pipeline((
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("sgd_reg", LinearRegression()),
plt.axis([0, 80, 0, 3])
plot_learning_curve(polynomial_regression, X, y)
NOTE: This diagram is not very clear
Here is the closed form solution for Ridge Regression
The regularization term is half the square of the ℓ2 norm
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)
def plot_model(model_class, polynomial, alphas, **model_kargs):
for alpha, style in zip(alphas, ("b-", "g--", "r:")):
model = model_class(alpha, **model_kargs) if alpha > 0 else LinearRegression()
if polynomial:
model = Pipeline([
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("std_scaler", StandardScaler()),
("regul_reg", model),
]), y)
y_new_regul = model.predict(X_new)
lw = 2 if alpha > 0 else 1
plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha = {}$".format(alpha))
plt.plot(X, y, "b.", linewidth=3)
plt.legend(loc="upper left", fontsize=15)
plt.xlabel("$x_1$", fontsize=18)
plt.axis([0, 3, 0, 4])
plot_model(Ridge, polynomial=False, alphas=(0, 10, 100), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)
plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1), random_state=42)
# Ridge Regression using sklearn
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42), y) # uses the X, y defined in the above cell
# using SGD
sgd_reg = SGDRegressor(max_iter=50, tol=-np.infty, penalty="l2", random_state=42), y.ravel())
The penalty hyperparameter sets the type of regularization term to use. Specifying"l2" indicates that you want SGD to add a regularization term to the cost function equal to half the square of the ℓ2 norm of the weight vector: this is simply RidgeRegression.
ridge_reg = Ridge(alpha=1, solver="sag", random_state=42), y)
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1), y)
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42), y)
# Logistic Regression using the iris classification example
from sklearn import datasets
iris = datasets.load_iris()
X = iris['data'][:, 3: ] # petal width
y = (iris['target']==2).astype( # 1 if Iris-Virginica, else 0
# Now let’s train a Logistic Regression model:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(), y)
# Let’s look at the model’s estimated probabilities for flowers with petal widths varyingfrom 0 to 3 cm
X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new, y_proba[:, 1], "g-", label="Iris-Virginica")
plt.plot(X_new, y_proba[:, 0], "b--", label="Not Iris-Virginica")
# A fancier plot
X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
decision_boundary = X_new[y_proba[:, 1] >= 0.5][0]
plt.figure(figsize=(8, 3))
plt.plot(X[y==0], y[y==0], "bs")
plt.plot(X[y==1], y[y==1], "g^")
plt.plot([decision_boundary, decision_boundary], [-1, 2], "k:", linewidth=2)
plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris-Virginica")
plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris-Virginica")
plt.text(decision_boundary+0.02, 0.15, "Decision boundary", fontsize=14, color="k", ha="center")
plt.arrow(decision_boundary, 0.08, -0.3, 0, head_width=0.05, head_length=0.1, fc='b', ec='b')
plt.arrow(decision_boundary, 0.92, 0.3, 0, head_width=0.05, head_length=0.1, fc='g', ec='g')
plt.xlabel("Petal width (cm)", fontsize=14)
plt.ylabel("Probability", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 3, -0.02, 1.02])
from sklearn.linear_model import LogisticRegression
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(
log_reg = LogisticRegression(solver="liblinear", C=10**10, random_state=42), y)
x0, x1 = np.meshgrid(
np.linspace(2.9, 7, 500).reshape(-1, 1),
np.linspace(0.8, 2.7, 200).reshape(-1, 1),
X_new = np.c_[x0.ravel(), x1.ravel()]
y_proba = log_reg.predict_proba(X_new)
plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs")
plt.plot(X[y==1, 0], X[y==1, 1], "g^")
zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz,
left_right = np.array([2.9, 7])
boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1]
plt.clabel(contour, inline=1, fontsize=12)
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(3.5, 1.5, "Not Iris-Virginica", fontsize=14, color="b", ha="center")
plt.text(6.5, 2.3, "Iris-Virginica", fontsize=14, color="g", ha="center")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
# iris classification using Softmax
X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10, random_state=42), y)
x0, x1 = np.meshgrid(
np.linspace(0, 8, 500).reshape(-1, 1),
np.linspace(0, 3.5, 200).reshape(-1, 1),
X_new = np.c_[x0.ravel(), x1.ravel()]
y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)
zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)
plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris-Virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris-Versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris-Setosa")
from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x0, x1, zz, cmap=custom_cmap)
contour = plt.contour(x0, x1, zz1,
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
softmax_reg.predict([[5, 2]])
softmax_reg.predict_proba([[5, 2]])